Professional Assembly Language

Chapter 1: What Is Assembly Language
- Processor Instructions
- High-Level Languages
Chapter 2: The IA-32 Platform
Chapter 3: The Tools of the Trade
Chapter 4: A Sample Assembly Language Program
Chapter 5: Moving Data
Chapter 6: Controlling Execution Flow
Chapter 7: Using Numbers
Chapter 8: Basic Math Functions
- Integer Arithmetic
Chapter 15: Optimizing Routines
- Optimized Compiler Code

Chapter 1: What Is Assembly Language

Processor Instructions

The memory bytes that contain the instruction codes are no different than the bytes that contain the data used by the processor.

To differentiate between data and instruction codes, special pointers are used to help the processor keep track of where in memory the data and instruction codes are stored.

High-Level Languages

Types of high-level languages
they all can be classified into two different categories, based on how they are run on the computer:
- Compiled languages
- Interpreted languages

Chapter 2: The IA-32 Platform

Often, the whole point of using assembly language is to exploit low- level features of the processor within your application program. Knowing what elements can be used to assist your programs in gaining the most speed possible can mean the difference between a fast application and a slow application.

The hardware and instruction code set designed for the Pentium processors is commonly referred to as the IA-32 platform.

The processor contains the hardware and instruction codes that control the operation of the computer. It is connected to the other elements of the computer (the memory storage unit, input devices, and output devices) using three separate buses: a control bus, an address bus, and a data bus.

The main components in the processor are as follows:

Control unit
Execution unit
Registers
Flags

Control unit

Retrieve instructions from memory.
Decode instructions for operation.
Retrieve data from memory as needed.
Store the results as necessary.

Because it takes considerably longer to retrieve data from memory than to process it, a backlog occurs, whereby the processor is continually waiting for instructions and data to be retrieved from memory. To solve this problem, the concept of prefetching was created.

To incorporate prefetching, a special storage area is needed on the processor chip itself — one that can be easily accessed by the processor, quicker than normal memory access. This was solved using pipelining.

The IA-32 platform implements pipelining by utilizing two (or more) layers of cache. The first cache layer (called L1) attempts to prefetch both instruction code and data from memory as it thinks it will be needed by the processor. As the instruction pointer moves along in memory, the prefetch algorithm determines which instruction codes should be read and placed in the cache. In a similar manner, if data is being processed from memory, the prefetch algorithm attempts to determine what data elements may be accessed next and also reads them from memory and places them in cache.

Of course, one pitfall to caching instructions and data is that there is no guarantee that the program will execute instructions in a sequential order.

To help alleviate this problem, a second cache layer was created. The second cache layer (called L2) can also hold instruction code and data elements, separate from the first cache layer. When the program logic jumps to a completely different area in memory to execute instructions, the second layer cache can still hold instructions from the previous instruction location. If the program logic jumps back to the area, those instructions are still being cached and can be processed almost as quickly as instructions stored in the first layer cache.

Branch prediction unit
While implementing multiple layers of cache is one way to help speed up processing of program logic, it still does not solve the problem of “jumpy” programs.

To help solve this problem, the IA-32 platform processors also incorporate branch prediction. Branch pre- diction uses specialized algorithms to attempt to predict which instruction codes will be needed next within a program branch.

Special statistical algorithms and analysis are incorporated to determine the most likely path traveled through the instruction code. Instruction codes along that path are prefetched and loaded into the cache layers.

The Pentium 4 processor utilizes three techniques to implement branch prediction:
- Deep branch prediction
- Dynamic data flow analysis
- Speculative execution
Out-of-order execution engine

This is where instructions are prepared for processing by the execution unit. It contains several buffers to change the order of instructions within the pipeline to increase the performance of the control unit.

Instructions retrieved from the prefetch and decoding pipeline are analyzed and reordered, enabling them to be executed as quickly as possible. By analyzing a large number of instructions, the out-of-order execution engine can find independent instructions that can be executed (and their results saved) until required by the rest of the program.

Execution unit

The main function of the processor is to execute instructions.

Registers

To help solve this problem, the processor includes internal memory locations, called registers. The regis- ters are capable of storing data elements for processing without having to access the memory storage unit. The downside to registers is that a limited number of them are built into the processor chip.

The core groups of registers available to all processors in the IA-32 family are shown in the following table.

General purpose Eight 32-bit registers used for storing working data
Segment Six 16-bit registers used for handling memory access
Instruction pointer A single 32-bit register pointing to the next instruction code to execute
Floating-point data Eight 80-bit registers used for floating-point arithmetic data
Control Five 32-bit registers used to determine the operating mode of the processor
Debug Eight 32-bit registers used to contain information when debugging the processor

General-purpose registers

The general-purpose registers are used to temporarily store data as it is processed on the processor.
Segment registers
The segment registers are used specifically for referencing memory locations. The IA-32 processor plat- form allows three different methods of accessing system memory:
- Flat memory model
- Segmented memory model
- Real-address mode
The flat memory model presents all system memory as a contiguous address space. All instructions, data, and the stack are contained in the same address space. Each memory location is accessed by a spe- cific address, called a linear address.

The segmented memory model divides the system memory into groups of independent segments, refer- enced by pointers located in the segment registers. Each segment is used to contain a specific type of data. One segment is used to contain instruction codes, another data elements, and a third the program stack.

Memory locations in segments are defined by logical addresses. A logical address consists of a segment address and an offset address. The processor translates a logical address to a corresponding linear address location to access the byte of memory.

If a program is using the real address mode, all of the segment registers point to the zero linear address, and are not changed by the program. All instruction codes, data elements, and stack elements are accessed directly by their linear address.
Instruction pointer register

The instruction pointer register (or EIP register), sometimes called the program counter, keeps track of the next instruction code to execute
Control registers

The five control registers are used to determine the operating mode of the processor, and the characteris- tics of the currently executing task
Flags

For each operation that is performed in the processor, there must be a mechanism to determine whether the operation was successful or not.

Advanced IA-32 Features

The x87 floating-point unit

o support these functions, additional instruction codes as well as additional registers and execution units were required. Together these elements are referred to as the x87 floating-point unit (FPU).
Multimedia extensions (MMX)

MMX was the first technology to support the Intel Single Instruction, Multiple Data (SIMD) execution model.

The SIMD model was developed to process larger numbers, commonly found in multimedia applica- tions. The SIMD model uses expanded register sizes and new number formats to speed up the complex number crunching required for real-time multimedia presentations.

The MMX environment includes three new floating-point data types that can be handled by the processor: ❑ 64-bit packed byte integers ❑ 64-bit packed word integers ❑ 64-bit packed doubleword integers
Streaming SIMD extensions (SSE)

SSE enhances performance for complex floating-point arithmetic, often used in 3-D graphics, motion video, and video conferencing.

The second implementation of SSE (SSE2) in the Pentium 4 processors incorporates the same XMM reg- isters that SSE uses, and also introduces five new data types: ❑ 128-bit packed double-precision floating point ❑ 128-bit packed byte integers ❑ 128-bit packed word integers ❑ 128-bit packed doubleword integers ❑ 128-bit packed quadword integers

A third implementation of SSE (SSE3) does not create any new data types, but provides several new instructions for processing both integer and floating-point values in the XMM registers.
Hyperthreading

One of the most exciting features added to the Pentium 4 processor line is hyperthreading. Hyperthreading enables a single IA-32 processor to handle multiple program execution threads simultaneously.

Chapter 3: The Tools of the Trade

The Development Tools

The assembler
- HLA
  
  The High Level Assembler (HLA) is the creation of Professor Randall Hyde. It creates Intel instruction code applications on DOS, Windows, and Linux operating systems.
  
  The HLA Web site is located at http://webster.cs.ucr.edu. Professor Hyde uses this Web site as a clearinghouse for various assembler information.
The linker

However, most assemblers do not automatically link the object code to produce the executable program file. Instead, a second manual step is required to link the assembly language object code with other libraries and produce an executable program file that can be run on the host operating system. This is the job of the linker.
The debugger
Similar to assemblers, debuggers are specific to the operating system and hardware platform for which the program was written. The debugger must know the instruction code set of the hardware platform, and understand the registers and memory handling methods of the operating system.

Most debuggers provide four basic functions to the programmer:
- Running the program in a controlled environment, specifying any runtime parameters required
- Stopping the program at any point within the program
- Examining data elements, such as memory locations and registers
- Changing elements in the program while it is running, to facilitate bug removal
The compiler
The object code disassembler

The GNU compiler enables you to view the generated assembly language code before it is assembled, but what about after the object file is already created?

A disassembler program takes either a full executable program or an object code file and displays the instruction codes that will be run by the processor. Some disassemblers even take the process one step further by converting the instruction codes into easily readable assembly language syntax.
The profiler

To determine how much processing time each function is taking, you must have a profiler in your toolkit. The profiler is able to track how much processor time is spent in each function as it is used dur- ing the course of the program execution.

The GNU Assembler

The GNU assembler program (called gas) is the most popular assembler for the UNIX environment.

Using the assembler

One oddity about the assembler is that although it is called gas, the command-line executable program is called as.
A word about opcode syntax

One of the more confusing parts of the GNU assembler is the syntax it uses for representing assembly language code in the source code file. The original developers of gas chose to implement AT&T opcode syntax for the assembler.

The GNU Linker

The GNU linker, ld, is used to link object code files into either executable program files or library files.

The GNU Compiler

The GNU Compiler Collection (gcc) is the most popular development system for UNIX systems.

The GNU Objdump Program

The GNU objdump program is another utility found in the binutils package that can be of great use to programmers. Often it is necessary to view the instruction codes generated by the compiler in the object code files. The objdump program will display not only the assembly language code, but the raw instruc- tion codes generated as well.

The GNU Profiler Program

The GNU profiler (gprof) is another program included in the binutils package. This program is used to analyze program execution and determine where “hot spots” are in the application.

Chapter 4: A Sample Assembly Language Program

The Parts of a Program

The three most commonly used sections are as follows: ❑ The data section ❑ The bss section ❑ The text section

Defining sections

.section .data
.section .bss
.section .text

Defining the starting point
To solve this problem, the GNU assembler declares a default label, or identifier, that should be used for the entry point of the application. The _start label is used to indicate the instruction from which the program should start running.

Besides declaring the starting label in the application, you also need to make the entry point available for external applications. This is done with the .globl directive.

The .globl directive declares program labels that are accessible from external programs. If you are writ- ing a bunch of utilities that are being used by external assembly or C language programs, each function section label should be declared with a .globl directive.
```
.section.data
<
initialized data here>
.section .bss
< uninitialized data here>
.section .text
.globl _start
_start:
<instruction code goes here>
```

Creating a Simple Program

This process replaces the x’s that were used as placeholders with the actual Vendor ID string pieces (note that the Vendor ID string was divided into the registers in the strange order EBX, EDX, and ECX)

The Linux write system call is used to write bytes to a file. Following are the parameters for the write system call: ❑ EAX contains the system call value. ❑ EBX contains the file descriptor to write to. ❑ ECX contains the start of the string. ❑ EDX contains the length of the string

Building the executable

$ as -o cpuid.o cpuid.s
$ ld -o cpuid cpuid.o

#or change _start to main
$ gcc -o cpuid cpuid.s

#debug
$ as -gstabs -o cpuid.o cpuid.s
$ ld -o cpuid cpuid.o

Linking with C library functions

$ ld -dynamic-linker /lib/ld-linux.so.2 -o cpuid2 -lc cpuid2.o

Chapter 5: Moving Data

Defining Data Elements

The data section

The following table shows the different directives that can be used to reserve memory for specific types of data elements.

.ascii Text string
.asciz Null-terminated text string
.byte Byte value
.double Double-precision floating-point number
.float Single-precision floating-point number
.int 32-bit integer number
.long 32-bit integer number (same as .int)
.octa 16-byte integer number
.quad 8-byte integer number
.short 16-bit integer number
.single Single-precision floating-point number (same as .float)

After the directive is declared, a default value (or values) must be defined. This sets the data in the reserved memory location to the specific values.

output:
.ascii “The processor Vendor ID is ‘xxxxxxxxxxxx’\n”

Defining static symbols
The .equ directive is used to set a constant value to a symbol that can be used in the text section, as shown in the following examples:
```
.equ factor, 3
.equ LINUX_SYS_CALL, 0x80
```
The bss section
Instead of declaring specific data types, you just declare raw segments of memory that are reserved for whatever purpose you need them for.
```
.comm Declares a common memory area for data that is not initialized
.lcomm Declares a local common memory area for data that is not initialized
```
While the two sections work similarly, the local common memory area is reserved for data that will not be accessed outside of the local assembly code. The format for both of these directives is .comm symbol, length
```
.section .bss
.lcomm buffer, 10000
```

Moving Data Elements

The MOV instruction formats

movx source, destination

where x can be the following:
❑ l for a 32-bit long word value
❑ w for a 16-bit word value
❑ b for an 8-bit byte value

Moving immediate data to registers and memory

movl $0, %eax
movl $0x80, %ebx
movl $100, height
# moves the value 0 to the EAX register
# moves the hexadecimal value 80 to the EBX register
# moves the value 100 to the height memory location

Moving data between registers

movl %eax, %ecx
movw %ax, %cx
# move 32-bits of data from the EAX register to the ECX register
# move 16-bits of data from the AX register to the CX register

Moving data between memory and registers
- Moving data values from memory to a register
  
  movl value, %eax
- Moving data values from a register to memory
  
  movl %ecx, value
- Using indexed memory locations
  The way this is done is called indexed memory mode. The memory location is determined by the following:
  - A base address
  - An offset address to add to the base address
  - The size of the data element
  - An index to determine which data element to select
  The format of the expression is base_address(offset_address, index, size) The data value retrieved is located at base_address + offset_address + index * size
- Using indirect addressing with registers
  
  Besides holding data, registers can also be used to hold memory addresses. When a register holds a memory address, it is referred to as a pointer. Accessing the data stored in the memory location using the pointer is called indirect addressing.
  
  While using a label references the data value contained in the memory location, you can get the memory location address of the data value by placing a dollar sign ($) in front of the label in the instruction. Thus the instruction movl $values, %edi
  
  The next instruction in the cpuid.s program: movl %ebx, (%edi) is the other half of the indirect addressing mode. Without the parentheses around the EDI register, the instruction would just load the value in the EBX register to the EDI register. With the parentheses around the EDI register, the instruction instead moves the value in the EBX register to the memory location con- tained in the EDI register.
  
  Instead of just allowing you to add a value to the register, you must place the value outside of the paren- theses, like so:
  
  movl %edx, 4(%edi) This instruction places the value contained in the EDX register in the memory location 4 bytes after the location pointed to by the EDI register. You can also go in the opposite direction:
  
  movl %edx, -4(&edi) This instruction places the value in the memory location 4 bytes before the location pointed to by the EDI register.

Conditional Move Instructions

The conditional move instructions are one of those tweaks, available starting in the P6 family of Pentium processors

cmovx source, destination

where x is a one- or two-letter code denoting the condition that will trigger the move action. The condi- tions are based on the current values in the EFLAGS register. The specific bits that are used by the conditional move instructions are shown in the following table.

CF Carry flag A mathematical expression has created a carry or borrow
OF Overflow flag An integer value is either too large or too small
PF Parity flag The register contains corrupt data from a mathematical operation
SF Sign flag Indicates whether the result is negative or positive
ZF Zero flag The result of the mathematical operation is zero

The following table shows the unsigned conditional move instructions.

CMOVA/CMOVNBE Above/not below or equal (CF or ZF) = 0
CMOVAE/CMOVNB Above or equal/not below CF=0
CMOVNC Not carry CF=0
CMOVB/CMOVNAE Below/not above or equal CF=1
CMOVC Carry CF=1
CMOVBE/CMOVNA Below or equal/not above (CF or ZF) = 1
CMOVE/CMOVZ Equal/zero ZF=1
CMOVNE/CMOVNZ Not equal/not zero ZF=0
CMOVP/CMOVPE Parity/parity even PF=1
CMOVNP/CMOVPO Not parity/parity odd PF=0

Exchanging Data

The instructions are described in the following table.

XCHG Exchanges the values of two registers, or a register and a memory location
BSWAP Reverses the byte order in a 32-bit register
XADD Exchanges two values and stores the sum in the destination operand
CMPXCHG Compares a value with an external value and exchanges it with another
CMPXCHG8B Compares two 64-bit values and exchanges it with another

The Stack

How the stack works

The stack behaves just the opposite. The stack is reserved at the end of the memory area, and as data is placed on the stack, it grows downward.

PUSHing and POPing all the registers

PUSHA/POPA Push or pop all of the 16-bit general-purpose registers
PUSHAD/POPAD Push or pop all of the 32-bit general-purpose registers
PUSHF/POPF Push or pop the lower 16 bits of the EFLAGS register
PUSHFD/POPFD Push or pop the entire 32 bits of the EFLAGS register

Optimizing Memor y Access
To solve this problem, Intel suggests following these rules when defining data:
- Align 16-bit data on a 16-byte boundary.
- Align 32-bit data so that its base address is a multiple of four.
- Align 64-bit data so that its base address is a multiple of eight.
- Avoid many small data transfers. Instead, use a single large data transfer.
- Avoid using larger data sizes (such as 80- and 128-bit floating-point values) in the stack

Chapter 6: Controlling Execution Flow

Unconditional Branches

When an unconditional branch is encountered in the program, the instruction pointer is automatically routed to a different location. You can use three types of unconditional branches: ❑ Jumps ❑ Calls ❑ Interrupts

Conditional Branches

The result of the conditional branch depends on the state of the EFLAGS register at the time the branch is executed.

There are many bits in the EFLAGS register, but the conditional branches are only concerned with five of them: ❑ Carry flag (CF) - bit 0 (lease significant bit) ❑ Overflow flag (OF) - bit 11 ❑ Parity flag (PF) - bit 2 ❑ Sign flag (SF) - bit 7 ❑ Zero flag (ZF) - bit 6

The following table describes all of the conditional jump instructions available.

JA Jump if above CF=0 and ZF=0
JAE Jump if above or equal CF=0
JB Jump if below CF=1
JBE Jump if below or equal CF=1 or ZF=1
JC Jump if carry CF=1
JCXZ Jump if CX register is 0 JECXZ Jump if ECX register is 0 JE Jump if equal ZF=1
JG Jump if greater ZF=0 and SF=OF
JGE Jump if greater or equal SF=OF
JL Jump if less SF<>OF
JLE Jump if less or equal ZF=1 or SF<>OF
JNA Jump if not above CF=1 or ZF=1
JNAE Jump if not above or equal CF=1
JNB Jump if not below CF=0
JNBE Jump if not below or equal CF=0 and ZF=0
JNC Jump if not carry CF=0
JNE Jump if not equal ZF=0
JNG Jump if not greater ZF=1 or SF<>OF
JNGE Jump if not greater or equal SF<>OF
JNL Jump if not less SF=OF
JNLE Jump if not less or equal ZF=0 and SF=OF
JNO Jump if not overflow OF=0
JNP Jump if not parity PF=0
JNS Jump if not sign SF=0
JNZ Jump if not zero ZF=0
JO Jump if overflow OF=1
JP Jump if parity PF=1
JPE Jump if parity even PF=1
JPO Jump if parity odd PF=0
JS Jump if sign SF=1
JZ Jump if zero ZF=1

The compare instruction
The format of the CMP instruction is as follows: cmp operand1, operand2 The CMP instruction compares the second operand with the first operand. It performs a subtraction oper- ation on the two operands behind the scenes (operand2 – operand1). Neither of the operands is modi- fied, but the EFLAGS register is set as if the subtraction took place.

Unlike the other flags, there are instructions that can specifically modify the carry flag. These are described in the following table
```
CLC Clear the carry flag (set it to zero)
CMC Complement the carry flag (change it to the opposite of what is set)
STC Set the carry flag (set it to one)
```

Loops

LOOP Loop until the ECX register is zero
LOOPE/LOOPZ Loop until either the ECX register is zero, or the ZF flag is not set
LOOPNE/LOOPNZ Loop until either the ECX register is zero, or the ZF flag is set

Optimizing Branch Instructions

Chapter 7: Using Numbers

Using Numbers

The core numeric data types are as follows: ❑ Unsigned integers ❑ Signed integers ❑ Binary-coded decimal ❑ Packed binary-coded decimal ❑ Single-precision floating-point ❑ Double-precision floating-point ❑ Double-extended floating-point

Integers

Standard integer sizes

The basic IA-32 platform supports four different integer sizes: ❑ Word: 16 bits ❑ Doubleword: 32 bits ❑ Byte: 8 bits Quadword: 64 bits

Unsigned integers

8 0 through 255
16 0 through 65,535
32 0 through 4,294,967,295
64 0 through 18,446,744,073,709,551,615

Signed integers

❑ Signed magnitude
❑ One’s complement
❑ Two’s complement

he IA-32 platform uses the two’s complement method to represent signed integers,

8 -128 to 127
16 -32,768 to 32,767
32 -2,147,483,648 to 2,147,483,647
64 -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Binary Coded Decimal

Floating-Point Numbers

Standard floating-point data types
The IEEE Standard 754 floating-point standard defines real numbers as binary floating-point values using three components: ❑ A sign ❑ A significand ❑ An exponent

The following table sums up the three types of floating-point formats used on the standard IA-32 platform.
```
Single precision 32 24 8 1.18 x 10^-38 to
                        3.40 x 10^38
Double precision 64 53 11 2.23 x 10^-308 to
                         1.79 x 10^308
Double extended 80 64 15 3.37 x 10^-4932 to
                        1.18 x 10^4932
```
The IA-32 FLD instruction is used for loading single- and double-precision floating-point numbers stored in memory onto the FPU register stack. To differentiate between the data sizes, the GNU assembler uses the FLDS instruction for loading single-precision floating-point numbers, and the FLDL instruction for loading double-precision floating-point numbers.

Similarly, the FST instruction is used for retrieving the top value on the FPU register stack and placing the value in a memory location. Again, for single-precision numbers, the instruction is FSTS, and for double-precision numbers, FSTL.

Using preset floating-point values

FLD1 Push +1.0 into the FPU stack
FLDL2T Push log(base 2) 10 onto the FPU stack
FLDL2E Push log(base 2) e onto the FPU stack
FLDPI Push the value of pi onto the FPU stack
FLDLG2 Push log(base 10) 2 onto the FPU stack
FLDLN2 Push log(base e) 2 onto the FPU stack
FLDZ Push +0.0 onto the FPU stack

SSE floating-point data types

The following two new 128-bit floating-point data types are available: ❑ 128-bit packed single-precision floating-point (in SSE) ❑ 128-bit packed double-precision floating-point (in SSE2)

Moving SSE floating-point values

MOVAPS Move four aligned, packed single-precision values to XMM
      registers or memory
MOVUPS Move four unaligned, packed single-precision values to XMM
      registers or memory
MOVSS Move a single-precision value to memory or the low doubleword
     of a register
MOVLPS Move two single-precision values to memory or the low
      quadword of a register
MOVHPS Move two single-precision values to memory or the high
      quadword of a register
MOVLHPS Move two single-precision values from the low quadword to
       the high quadword
MOVHLPS Move two single-precision values from the high quadword to
       the low quadword

SSE2 floating-point values

MOVAPD Move two aligned, double-precision values to XMM registers
      or memory
MOVUPD Move two unaligned, double-precision values to XMM registers
      or memory
MOVSD Move one double-precision value to memory or the low
     quadword of a register
MOVHPD Move one double-precision value to memory or the high
      quadword of a register
MOVLPD Move one double-precision value to memory or the low
      quadword of a register

Chapter 8: Basic Math Functions

Integer Arithmetic

Addition

add source, destination

where source can be an immediate value, a memory location, or a register. The destination param- eter can be either a register or a value stored in a memory location (although you cannot use a memory location for both the source and destination at the same time). The result of the addition is placed in the destination location.

The ADD instruction can add 8-, 16-, or 32-bit values. As with other GNU assembler instructions, you must specify the size of the operands by adding a b (for byte), w (for word), or l (for doubleword) to the end of the ADD mnemonic.

The ADC instruction can be used to add two unsigned or signed integer values, along with the value con- tained in the carry flag from a previous ADD instruction. To add multiple groups of bytes, you can chain together multiple ADC instructions, as the ADC instruction also sets the carry and overflow flags as appro- priate for the operation.

The format of the ADC instruction is adc source, destination where source can be an immediate value or an 8-, 16-, or 32-bit register or memory location value, and destination can be an 8-, 16-, or 32-bit register or memory location value. (Similar to the ADD instruc- tion, source and destination cannot both be memory locations at one time).
Subtraction

The format of the SUB instruction is sub source, destination where the source value is subtracted from the destination value, with the result stored in the destination operand location. The source and destination operands can be 8-, 16-, or 32-bit registers or values stored in memory (but again, they cannot both be memory locations at the same time). The source value can also be an immediate data value.

Just like in addition, you can use the carry condition to your advantage to subtract large signed integer values. The SBB instruction utilizes the carry and overflow flags in multibyte subtractions to implement the borrow feature across data boundaries. The format of the SBB instruction is sbb source, destination where the carry bit is added to the source value, and the result is subtracted from the destination value. The result is stored in the destination location
Incrementing and decrementing

The format of the instructions is dec destination inc destination
Multiplication
he format for the MUL instruction is mul source where source can be an 8-, 16-, or 32-bit register or memory value. You might be wondering how you can multiply two values by only supplying one operand in the instruction line. The answer is that the destination operand is implied.

Working with the implied destination operand is somewhat complicated. For one thing, the destination location always uses some form of the EAX register, depending on the size of the source operand. Thus, one of the operands used in the multiplication must be placed in the AL, AX, or EAX registers, depending on the size of the value.

Unfortunately, when multiplying a 16-bit source operand, the EAX register is not used to hold the 32-bit result. In order to be backwardly compatible with older processors, Intel uses the DX:AX register pair to hold the 32-bit multiplication result value (this format started back in the 16-bit processor days). The high-order word of the result is stored in the DX register, while the low-order word is stored in the AX register.

For 32-bit source values, the 64-bit EDX:EAX register pair is used, again with the high-order doubleword in the EDX register, and the low-order doubleword in the EAX. Make sure that if you have data stored in the EDX (or DX) register that you save it elsewhere when using the 16- or 32-bit versions of MUL.
```
8 bits AL AX
16 bits AX DX:AX
32 bits EAX EDX:EAX
```
The first format of the IMUL instruction takes one operand, and behaves exactly the same as the MUL instruction: imul source

The second format of the IMUL instruction enables you to specify a destination operand other than the EAX register: imul source, destination where source can be a 16- or 32-bit register or value in memory, and destination must be a 16- or 32-bit general-purpose register. This format enables you to specify where the result of the multiplication will go (instead of being forced to use the AX and DX registers).

he third format of the IMUL instruction enables you to specify three operands: imul multiplier, source, destination where multiplier is an immediate value, source is a 16- or 32-bit register or value in memory, and destination must be a general-purpose register. This format enables you to perform a quick multi- plication of a value (the source) with a signed integer (the multiplier), storing the result in a general- purpose register (the destination).
Division
- Unsigned division
  The DIV instruction is used for dividing unsigned integers. The format of the DIV instruction is div divisor
  
  where divisor is the value that is divided into the implied dividend, and can be an 8-, 16-, or 32-bit reg- ister or value in memory. The dividend must already be stored in the AX register (for a 16-bit value), the DX:AX register pair (for a 32-bit value), or the EDX:EAX register pair (for a 64-bit value) before the DIV instruction is performed.
```
AX 16 bits AL AH
DX:AX 32 bits AX DX
EDX:EAX 64 bits EAX EDX
```
- Signed division
  
  The IDIV instruction is used exactly like the DIV instruction, but for dividing signed integers. It too uses an implied dividend, located in the AX register, the DX:AX register pair, or the EDX:EAX register pair. Unlike the IMUL instruction, there is only one format for the IDIV instruction, which specifies the divisor used in the division: idiv divisor
Shift Instructions
- Multiply by shifting
  To multiply integers by a power of 2, you must shift the value to the left. Two instructions can be used to left shift integer values, SAL (shift arithmetic left) and SHL (shift logical left). Both of these instructions perform the same operation, and are interchangeable. They have three different formats:
```
sal destination
sal %cl, destination
sal shifter, destination
```
  The first format shifts the destination value left one position, which is the equivalent of multiplying the value by 2.
  
  The second format shifts the destination value left by the number of times specified in the CL register.
  
  The final version shifts the destination value left the number of times indicated by the shifter value.
- Dividing by shifting
  
  The SHR instruction clears the bits emptied by the shift, which makes it useful only for shifting unsigned integers. The SAR instruction either clears or sets the bits emptied by the shift, depending on the sign bit of the integer. For negative numbers, the bits are set to 1, but for positive numbers, they are cleared to zero.
- Rotating bits
```
ROL Rotate value left
ROR Rotate value right
RCL Rotate left and include carry flag
RCR Rotate right and include carry flag
```
Logical Operations
- Boolean logic
  
  When working with binary numbers, it is handy to have the standard Boolean logic functions available. The following Boolean logic operations are provided: ❑ AND ❑ NOT ❑ OR ❑ XOR
  
  The AND, OR, and XOR instructions use the same format: and source, destination
- Bit testing
  
  The format of the TEST instruction is the same as for the AND instruction. Even though no data is written to the destination location, you still must specify any immediate values as the source value. This is simi- lar to how the CMP instruction works like the SUB instruction, but it does not store the result anywhere.
  
  As mentioned, the most common use of the TEST instruction is to check for flags in the EFLAGS register.

Chapter 15: Optimizing Routines

However, just writing functions in assembly language code instead of C or C++ does not necessarily make them perform better. Remember, the GNU compiler already converts all of your high-level language code to assembly language, so writing a function in assembly language just means that you did it instead of the compiler.

Optimized Compiler Code

The -O family of compiler options provides steps of optimization for the GNU compiler. Each step provides a higher level of optimization. There are currently three steps available for optimizing:

❑ -O: Provides a basic level of optimization ❑ -O2: Provides more advanced code optimization ❑ -O3: Provides the highest level of optimization

Each individual optimization technique can be referenced using the -f command- line option. The -O options bundle various -f options together in a single option.

Compiler optimization level 1

The -f optimization functions included at this level are described in the following list:

❑ -fdefer-pop: This optimization technique is related to how the assembly language code acts when
a function is finished. Normally, input values for functions are placed on the stack and accessed
by the function. When the function returns, the input values are still on the stack. Normally, the
input values are popped from the stack immediately following the function return.
This option permits the compiler to allow input values to accumulate on the stack across func-
tion calls. The accumulated input values are then removed all at once with a single instruction
(usually by changing the stack pointer to the proper value). For most operations this is perfectly
legal, as input values for new functions are placed on top of the old input values. However, this
does make things somewhat messy on the stack.

❑ -fthread-jumps: This optimization technique relates to how the compiler handles both condi-
tional and unconditional branches in the assembly code. In some cases, one jump instruction
may lead to another conditional branch statement. By threading jumps, the compiler determines
the final destination between multiple jumps and redirects the first jump to the final destination.

❑ -fmerge-constants: With this optimization technique, the compiler attempts to merge identical
constants. This feature can sometimes result in long compile times, as the compiler must analyze
every constant used in the C or C++ program, comparing them with one another.

❑ -floop-optimize: By optimizing how loops are generated in the assembly language, the com-
piler can greatly increase the performance of the application. Often, programs consist of many
loops that are large and complex. By removing variable assignments that do not change value
within the loops, the number of instructions performed within the loop can be reduced, greatly improving performance. In addition, any conditional branches made to determine when to
leave the loop are optimized to reduce the effects of the branching.

❑ -fif-conversion: Next to loops, if-then statements are the second most time-consuming part of
an application. A simple if-then statement can generate numerous conditional branches in the
final assembly language code. By reducing or eliminating conditional branches and replacing
them with conditional moves, setting flags, and performing arithmetic tricks, the compiler can
reduce the amount of time spent in the if-then statements.

❑ -fif-conversion2: This technique incorporates more advanced mathematical features that reduce
the conditional branching required to implement the if-then statements.

❑ -fdelayed-branch: This technique attempts to reorder instructions based on instruction cycle
times. It also attempts to move as many instructions before conditional branches as possible to
maximize the use of the processor instruction cache.

❑ -fguess-branch-probability: As its name suggests, this technique attempts to determine the
most likely outcome of conditional branches, and moves instructions accordingly, similar to the
delayed-branch technique. Because the code placement is predicted at compile time, it is quite
possible that compiling the same C or C++ code twice using this option can produce different
assembly language source code, depending on what branches the compiler thought would be
used at compile time.
Because of this, many programmers prefer not to incorporate this feature, and specifically
include the –fno-guess-branch-probability option to turn it off.

❑ -fcprop-registers: As registers are allocated to variables within functions, the compiler performs
a second pass to reduce scheduling dependencies (two sections requiring the same register) and
eliminate needlessly copying registers.

Compiler optimization level 2

The second level of code optimization (-O2) incorporates all of the optimization techniques of the first level, plus a lot of additional techniques. These techniques are related to more specific types of code, such as loops and conditional branches. If the basic assembly language code generated by the compiler does not utilize the type of code analyzed in this level, no additional optimization will be performed. The following list describes the additional -f optimization options that are attempted at this level.

❑ -fforce-mem: This optimization forces all variables stored in memory locations to be copied to
   registers before using them in any instructions. For variables that are only involved in a single
  instruction, this may not be much of an optimization. However, for variables that are involved
 in a lot of instructions (such as mathematical operations), this can be a huge optimization, as the
processor can access the value in a register much quicker than in memory.
❑ -foptimize-sibling-calls: This technique deals with function calls that are related and/or recur-
   sive. Often, recursive function calls can be unrolled into a common string of instructions, rather
  than using branching. This enables the processor instruction cache to load the unrolled instruc-
 tions and process them faster than if they remain as separate function calls requiring branching.
❑ -fstrength-reduce: This optimization technique performs loop optimization and eliminates itera-
   tion variables. Iteration variables are variables that are tied to loop counters, such as for-next loops
  that use a variable and then perform mathematical operations using the loop counter variable.

❑
-fgcse: This performs Global Common Subexpression Elimination (gcse) routines on all of the
generated assembly language code. These optimizations attempt to analyze the generated
assembly language code and combine common pieces, eliminating redundant code segments.
It should be noted that the gcc instructions recommend using –fno-gcse if the code uses
computed gotos.
❑ ❑ -frerun-cse-after-loop: This technique reruns the Common Subexpression Elimination routines
       after any loops have been optimized. This enables loop code to be further optimized after it has
      been unrolled.
❑ -fdelete-null-pointer-checks: This optimization technique scans the generated assembly language
   code for code that checks for null pointers. The compiler assumes that dereferencing a null pointer
  would halt the program. If a pointer is checked after it has been dereferenced, it cannot be null.
❑ -fexpensive-optimizations: This performs various optimization techniques that are expensive
   from a compile-time point of view, but it can have a negative effect on runtime performance.
❑ -fregmove: The compiler will attempt to reassign registers used in MOV instructions and as
   operands of other instructions in order to maximize the amount of register tying.
❑ -fschedule-insns: The compiler will attempt to reorder instructions in order to eliminate proces-
   sor waits for data. For processors that have delays associated with floating-point arithmetic, this
  enables the processor to load other instructions while it waits for the floating-point results.
❑ -fsched-interblock: This technique enables the compiler to schedule instructions across blocks
   of instructions. This provides greater flexibility in moving instructions around to maximize
  work done during wait times.
❑ -fcaller-saves: This option instructs the compiler to save and restore registers around function
   calls to enable the functions to clobber register values without having to save and restore them.
  This can be a time-saver if multiple functions are called because the registers are saved and
 restored only once, instead of within each function call.
❑ -fpeephole2: This option enables any machine-specific peephole optimizations.
❑ -freorder-blocks: This optimization technique enables blocks of instructions to be reordered to
   improve branching and code locality.
❑ -fstrict-aliasing: This technique enforces strict variable rules for the higher-level language. For C
   and C++ programs, it ensures that variables are not shared between data types. For example, an
  integer variable cannot use the same memory location as a single-precision floating-point variable.
❑ -funit-at-a-time: This optimization technique instructs the compiler to read the entire assembly
   language code before running the optimization routines. This enables the compiler to reorder
  non-time-sensitive code to optimize the instruction cache. However, it takes considerably more
 memory during compile time, which may be a problem for smaller machines.
❑ -fcse-follow-jumps: This particular Common Subexpression Elimination (cse) technique
scans through a jump instruction looking for destination code that is not reached via any other
means within the program. The most common example of this is the else part of if-then-
else statements.

-falign-functions: This option is used to align functions at the start of a specific boundary in
memory. Most processors read memory in pages, and enabling an entire function code to reside
in a single page can improve performance. If a function crosses pages, another page of memory
must be processed to complete the function.

❑ -falign-loops: Similar to aligning functions, loops that contain code that is processed multiple times
   can benefit from being aligned within a page boundary in memory. When the loop is processed, if it
  is contained within a single memory page, no swapping of pages is required for the code.
❑ -fcrossjumping: The process of cross-jumping transforms code to combine equivalent code
   scattered throughout the program. This saves code size, but it may not have a direct impact on
  program performance.

Compiler optimization level 3

The highest level of optimization provided by the compiler is accessed using the -O3 option. It incorpo- rates all of the optimization techniques listed in levels one and two, along with some very specific addi- tional optimizations. Again, there is no guarantee that this level of optimization will improve performance of the final code. The following -f optimization options are included at this level:

❑ -finline-functions: Instead of creating separate assembly language code for functions, this opti-
   mization technique includes the function code within the code from the calling program. For func-
  tions that are called multiple times, the function code is duplicated for each function call. While
 this may not be good for code size, it can increase performance by maximizing the instruction
cache code usage, instead of branching on each function call.
❑ -fweb: This constructs a web of pseudo-registers to hold variables. The pseudo-registers contain
   data as if they were registers, but can be optimized by the various other optimization techniques,
  such as cse and loop optimizing.
❑ -fgcse-after-reload: This technique performs a second gcse optimization after completely
   reloading the generated and optimized assembly language code. This helps eliminate any
  redundant sections created by the different optimization passes.